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BACKGROUND OF THE INVENTION / 
Field of the Invention / 
The present invention relates to a technique for 
measuring the importance of words or wccrd sequences in a 
group of documents, and is intended fpr use in supporting 
document retrieval and automatic construction of a word 
dictionary among other purposes./ 

Description of the Related Art 

Fig. 1 illustrates a document retrieval system having 
windows for displaying "topic words" in the retrieved 
documents, wherein the w/ndow on the right side selectively 
displays words in the documents displayed in that on the left 
side. An example of j/uch a system is disclosed, for example, 
in the Japanese Published Unexamined Patent Application No. 
Hei 10-74210, ^Document Retrieval Supporting Method and 
Document Relieving Service Using It" (Reference 1). 

Kyo /ageura (et al . ) , "Methods of automatic term 
recognition: A review," Terminology, 1996) (Reference 2) 
describes a method for calculating the importance of words. 
Methods to calculate the importance of words has long been 



studied with a view to automatic term extraction^ or 
facilitating literature searching by weightingf words 
characterizing the desired document. / 

Words may be weighted either to extract important 
words from a specific document or to extract important words 
from all documents. The best known in connection with the 
former is tf-idf , where idf is the logarithm of the quotient 
of the division of the total numbey N of documents by the 
number N(w) of documents in which/a certain word w occurs 
while tf is the frequency of occurrence f (w, D) of the word 
in a document d; tf-idf, as ttae product of these factors, 
is represented by: / 

f(w, d) x log2(N/N(w)) / 

There are variations including the following square 
root of f (w, d) : / 

f (w, d) **0 . 5 x log2 (w) ) . Whereas there further are many 

other variations, yff-idf is set, as its basic nature, to 
become "greater /s the word occurs more frequently and 
concentrates ±n a smaller number of documents." 

Though not stated in Reference 2, a natural method to 
expand this/neasure, instead of pertaining to the importance 
of a word/ in a specific document, into a measure of the 
importance of the word in the set of all documents is to 
repla/e f (w, d) with f (w) , the frequency of w in all 
documents . 
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One of the methods to extract important words/from all 
documents is to measure the accidentalness of differences 
in the frequency of occurrence of each word firam one given 
document category to another, and to qualify as important 
words what have a higher degree of non-acc/dentalness . The 
accidentalness of differences can be measured by several 
measures including the chi-square tesjc, and this method 
requires the categorization of the doczument set in advance. 

In a separate context from thfese studies, there are 
a series of attempts to identify ar collection of words (or 
word sequences) which qualify as5 important words (or word 
sequences) from the standpoint of natural language 
processing. In these studies, methods have been proposed 
by which words (or word sequences) to be judged as important 
are to be restricted by zhe use of grammatical knowledge 
together with the intensity of the co-occurrence of 
adjoining words assessed by various measures. As such 
measures, there aroused (pointwise) mutual information, 
the log-likelihoota ratio and so forth. 



LIEF SUMMARY OF THE INVENTION 
Techniques so far used involve the following 
problems: jfl) tf-idf (or its like) is not accurate enough 
- the contribution of the frequency of a word empirically 
tends tjzf be too large, making it difficult to exclude such 
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too common stop-words as "do"; (2) while a method to compare 
differences in the distribution of a specific word/among 
categories requires the classification of documents in 
advance, this requirement generally is not satisfied; (3) 
a method to utilize the intensity of co-occurryence between 
adjoining words cannot evaluate the importance of a single 
word. It is also not easy to extend the metriods so that it 
can treat a word sequence containing n worlds (n>2) ; and (4) 
the setting of a threshold value for selecting important 
words has been difficult and apt to be ad hoc. An object 
of the present invention is to prov/de a method free from 
such problems. / 

In the following description!, a "term" means a word or 
a word sequence. To paraphrases the "importance of a term" 
from the viewpoint of term extraction or information 
retrieval, that a given term/is important means that the term 
indicates or represent a/topic (or topics) of some 
significance, in other ywords, the term is informative or 
domain-specific. In tfne following, such a term is said to 
be "representative" in this context the "importance" of 

a term is also caYled the representativeness of a term. 
Since such a term/is likely to be useful in taking an overview 
of the content's of a document set, it is important in 
information /retrieval or a support system thereto. 



In measuring the degree of representativeness, 
conventional method would take only the distribution or the 
pertinent term itself. However, a method like tf-idr is not 
accurate enough though simple, or a method using a/statistic 
such as the chi square involves difficulty ir/ obtaining 
statistically significant values for most of/terms because 
the frequency of a term is too low to protserly apply such 
statistical test, except in rare cases, arid this results in 
low precision. / 

The present invention takes note not of the 
distribution of a specific term bux of the distribution of 
words occurring in association with the term noted. This is 
based on a working hypothesis/that "the representativeness 
of a term is related to the /unevenness of the distribution 
of words occurring together with the term" and that a given 
term is "representative " means that "the distribution of 
words occurring witly the term are characteristic." 

Therefore, the present invention uses, in calculating 
the representativeness of a word W, the difference between 
the word distribution in D(W), the set of documents which 
consists of yevery document containing W, and the word 
distribution in the whole documents from which said D(W) 
derives . /in particular, the characteristic consists in that 
the difference is determined by comparing two distances, d 
and d/. Here, d is the distance between said D(W) and the 



whole documents, and d', the distance between a randomly 
selected subset of documents containing substantially the 
same number of words as said D(W) and the whole; documents, 
where the concept of "distance between two /documents" 
includes the distance between two word distributions: that 
in one document set and that in another/ 

Other and further objects, features and advantages of 
the invention will appear more fulw from the following 
description. / 

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS 

A preferred form of the present invention is 
illustrated in the accompanying drawings in which: 

Fig. 1 shows an example of information retrieval 
support system having a window to display topic words; 

Fig. 2 showsy4n example of distance between two word 
distributions; / 

Fig. 3 anows a hardware configuration for realizing 
a proposed yord importance calculation method; 

Fig/ 4 shows the configuration of a 
representativeness calculation program; 

/Fig. 5 shows an example of configuration for use in 
applying representativeness to displaying of retrieved 
documents in support of document retrieval; 



Fig. 6 shows an example of configuration 2or use in 
applying representativeness to automatic word/extraction; 

Fig. 7 is a graph of results of an experiment showing 
how the proposed word importance raises the ranks of words 
considered suitable for summarizing th^/ results of 
retrieval in comparison with other measures; 

Fig. 8 is a graph of results ofyan experiment showing 
how the proposed word importance lowers the ranks of words 
considered unsuitable or unnecessary for summarizing the 
results of retrieval in comparison with other measures. 



DETAILED DESCRIPTION OF THE INVENTION 
The present invention will be described in detail 
below. / 

First will be explained the signs used for 
implementing the inyention; 301 denotes a storage; 3011, 
text data; 3012, a ^morphological analysis program; 3013, a 
word-document association program; 3014, a word-document 
association database (DB) ; 3015, a representativeness 
calculation pnrogram; 3016, a representativeness DB; 3017, 
a shared data area; 3018, a working area; 302, an input 
device; 3^03, a communication device; 304, a main memory; 
305, a CPU; 306, a terminal device; 4011, a module for 
calculating background word distribution; 4012, module for 
calculating baseline function; 4013, a document extraction 



module; 4014, a module for calculating co-occurring jword 
distribution; . 4015, a module for calculating distance 
between word-distributions; 4016, a module for normalizing 
distance between word distributions; 4017, a /random 
sampling module; 544, a topic words displacing routine; 
5441, a topic words extraction routine; >5442 , a co- 
occurrence analysis routine; 5443, a graph mapping routine; 
5444, a graph displaying routine; 604, storage devices; 
6011, text data; 6012, a morphological analysis program; 
6013, a word-document association program; 6014, a word- 
document association database,-/ 6015, a database for 
extracted words; 6016, a woyking area; 6017, a 
representativeness calculation program; 6018, a 
representativeness DB; 6019, a shared data area; 601A, a 
program for extracting yword sequences; 601B, a program for 
grammatical filtering; 601C, a filtering program; 602, an 
input device; 603,/a communication device; 604, a main 
memory; 605, a CPTJ; and 606, a terminal device consisting 
of a display, A keyboard and so forth. 

The fol/lowing description will concern a method for 
assessing zhe representativeness of any term and its 
application to an information retrieval system. First, 
measures of assessing the representativeness of a term is 
introduced by mathematically rephrasing the idea stated in 
BRLEF SUMMARY OF THE INVENTION above. Thus, with respect 
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to any term W (word or word sequence) , note is taken 'the 

word distribution in D(W), the set of documents tmat 

consists of every document containing the term W a^id the word 

distribution in the whole documents. More specifically, 

Rep(W) , the representativeness of W is defined on the basis 

of Dist { PD (W) , PO}, the distance of two distributions PD(W) 

and PO, where DO is t the set of the who Ip documents ; PD(W) , 

word distribution in D(W); PO, word yflistribution in DO. 

P Whereas many methods of measuring the distance 

q\ between word distributions are conceivable, the principal 

pi ones of which include (1) the /Log-likelihood ratio, (2) 

% ! / 

tj Kullback-Leibler divergence,/ (3) transition probability 

U / 

and ( 4 ) vector-space model /cosign method), it has been 

s / 

confirmed that steady results can be obtained by using, for 
HJ instance, the log-likelahood ratio. The distance between 

ni / 

CI PD(W) and PO, using toe log-likelihood ratio, is defined 

M / 

below where {wl, . ./, wn} represent all words, and ki and 

Ki, the frequencies of the occurrence of a word wi in D(W) 

and DO, respectively. 

Numerical exp/ession 1: 

£ *' ,os u nlw/ - ^ *' ,os TK~ 

i = i # D {IV/) i = i U D 0 



Fyg. 2 displays words corresponding to coordinates 
(#D(Wy, Dist{PD(W) , PO})s where W varies over said words, 
and also it plots coordinates (#£?, Dist{P Df P 0 })s where D 
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varies over randomly selected document sets, where "tile 
displayed words and the document sets are taken fj/om 
articles in the 1996 issues of a financial newspaper Nihon 
Keizai Shimbun. 

As is seen in Fig. 2, comparison of Djfs t{PD ( Wl ) , PO} 
and Dist{PD (W2) , PO} is consistent with wha/t human intuition 
tells when #D(W1) and #D(W2) are close/to each other. For 
instance, "USA" has a higher value of^Dist { PD (W) , PO} than 
"suru" (do) and so does "Aum", wh^ch is the name of an 
infamous cult, than "combine". /However, a pair of terms 
whose #D(W) values widely differ, (this means that there is 
a large difference between/the frequency of two terms) 
cannot be appropriately (compared in terms of 
representativeness, be/ause usually Dist{PD(W), PO} 
increases as #D(W) increases. Actually, "Aum" and "suru" 
are about equal in 2;ist{PD(W) , PO}, which is against human 
linguistic intuiycion. Then, in order to offset the 
intrinsic beha/ior of Dist{ PO}, { (#D, Dist{PD, P0})}s 
plotted in Fjtq . 2 using "x" marks are to be investigated. 
These points are likely to be well approximated by a single 
smooth cu/ve beginning at (0, 0) and ending at (#D0, 0). This 
curve w/ll be hereinafter referred to as the baseline curve. 

tfhereas it is evident that by definition Dist{PD, PO} 
is Of when D = 0 and D = DO, it has been confirmed that the 
behavior of the baseline curve in the neighborhood of (0, 
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0) is stable and similar to each other when the size of/che 
whole documents varies over a broad range (say, about/z,000 
documents to a full-year total of newspapers amounting to 
about 3000,000 documents). / 

Then, an approximating function B( • ) is/xigured out 
in a section (1000 < #D < 20000) where the baseline curve 
can be approximated with steadily high ac/uracy using an 
exponential function, and the level of representativeness 
of W satisfying the condition of 1000/< #D(W) < 20000 is 
defined by a value: Rep(W) = Dist{PD(w/, P0} /B(#D(W)), that 
is, a value obtained by normalizing Dist{PD(W), P0} with 
B( • ) . (It has to be noted that Wie "words" in this context 
are already cleared of all those which are considered 
certain to be unnecessary bjs query terms for information 
retrieval, such as symbols^ particles and auxiliary verbs. 
While the same method car/be realized even if these elements 
are included, in that czlase there will be some changes in the 
above-cited numerale. ) 

With a view/to making it possible to use the well- 
approximated re/ion of the aforementioned baseline function 
even where #D/W) is significantly great as in the case of 
"suru" and tczs reducing the amount of calculation, about 150 
documents /are extracted at random from D(W), which is 
denoted p' (W) , so that 20, 000 < #D' (W) holds, and Rep (W) is 
calculated using D' (W) instead of D(W) . 
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On the other hand, as the approximating function/of 
the baseline curve figured out in the aforesaid secxion 
tends to overestimate the value in {x|0 < x < 1000} /Rep (W) 
is likely to be underestimated for W in the ranged of #D(W) 
< 1000 as a result of normalization. However, Wnereas 1000 
words approximately correspond to two or three newspaper 
articles, terms which occur in the number /6f documents in 
that order is not very important for our purpose, the 
calculated result was applied as it was/ Of course, another 
baseline may as well be calculated in advance. Dist{PD}, 
P0}/B(#D) in the randomly sampled ^document set D steadily 
gave an average, Avr, of approximately 1 (±0,01) and a 
standard deviation o of arouryd 0.05 in various corpora. 
Since the maximum never surpassed Avr + 4 a , as the basis 
of judgment that the Rep (w/ value of a given term is "a 
meaningful value" or not,/a threshold value of Avr + 4 a = 
1.20 is provided. 

The above-cited/measure Rep ( • ) has such desirable 
features that (1) iys definition is mathematically clear, 
(2) it allows comparison of highly frequent terms and 
infrequent terms/, (3) the threshold value can be defined 
systematically^ and (4) it is applicable to terms consisting 
of any number of words. 

The effectiveness of the measure Rep ( • ) proposed in 
the present invention has been confirmed by experiments as 



well. Out of words which occurred three times or more in 
total in the articles in the 1966 issues of the Nihon Keijzai 
Shimbun, 20, 000 words were extracted at random, and ,2, 000 
out of them were manually classified into three categories: 
their occurrence in the overview of retrieved contents is 

"desirable a", "neither desirable nor undesirable" and 

"undesirable d" . The 20, 000 words are ranker! by a measure 

and the number of words which are classif ied/Lnto a specified 
class and appear between the first word/and the Nth word, 
which number is hereafter called "accumulated number of 
words", is compared to that obtained by using another 
measure. In the following, four measures will be used, 
comprising random (i.e., no measure) , frequency, tf-idf and 
a proposed measure. Here is used as tf-idf the version of 
tf-idf covering all documents, which was explained in the 
section on the prior art. /Thus it is defined as f (w)**0.5 
x log2(N/N(w)) where N i/s the number of all the documents, 
N(w) is the number of documents in which w appears, and f (w) 
is the frequency of/w in all the documents. 

Fig. 7 compares the accumulated number of words 
classified as "sr. As is evident from the graph, the force 
to raise the cranks of words classified as "a" is stronger 
in the order of random < frequency < tf-idf < proposed 
measure. /The improvement is evidently significant. Fig. 
8 compai/es the accumulated numbers of words classified as 



"d"; the superiority of the proposed measure in sorting 
capability is distinct. Frequency and tf-idf are Ao 
different from random cases, revealing their inferiority in 
the "stop-word" identifying capability. In view of these 
findings, the measure proposed according to Juie invention 
is particularly effective in identifying stGfp-words, and is 
expected to be successfully applied to /he automatic 
preparation of a stop-word lists- and the improvement of the 
accuracy of weighting in the calculation of document 
similarity by "excluding frequenlz but non-representative 
words". / 

An example of system configuration for the 
calculation of representativeness so far described is 
illustrated in Fig. 3. Calculation of representativeness 
will now be described Joe low with reference to Figs. 3 and 
4, in which 301 denotes a storage for storing document data, 
various programs apid so forth using a hard disk or the like. 
It is also utilized as a working area for programs. 
Thereafter, 3011 denotes document data (although Japanese 
is used in Ime following example, this method is not 
language-specific) ; 3012, a morphological analysis program 
for identifying words constituting a document (it performs 
such processing as word separation by spaces and part- 
of-speech tagging in Japanese, or stemming in English; this 
method is not specified; various systems are disclosed in 



both languages, whether for commercial use or research 
purposes); 3013, a word-document association program (for 
checking, according to the results of morphological 
analysis, which word occurs in which document and how often, 
or conversely in which document how many times which word 
occurs; basically this is a task to fill elements of a matrix 
having words as rows and documents as /Columns by counting, 
and no particular method is specifier for this task) ; 3014, 
a word-document association database (DB) for recording 
word-document association dat^ calculated as described 
above; 3015, a representativeness calculation program, a 
program for calculating tbfe representativeness of a term, 
whose details are shown An Fig. 4; 3016, a DB for recording 
the calculated representativeness of terms; 3017, an area 
for a plurality of programs to reference data in a shared 
manner; 3018, a working area; 302, an input device; 303, a 
communication device; 304, a main memory; 305, a CPU; and 
306, a terminal device consisting of a display, a keyboard 
and so fortm. 

Fia/( 4 illustrates details of the representativeness 
calculation program 3015. The method of calculating the 
representativeness of a specific term by using this program 
wil/L be described below. In the figure, 4011 denotes a 
mc/dule for calculating background word distribution. This 
module is used only once, and records the frequency of each 
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word in the whole documents. Thus, all words bei^rg 
represented by {wl, . . wn} and Ki denoting the/rrequency 
of the occurrence of a word wi in the whole document DO as 
is the case with Numerical expression 1, CK1, . . . Kn) is 
recorded. Reference numeral 4012 denotes a module for 
estimating the baseline function with/regard to given 
document data. This module, too, is/used only once at the 
beginning. It can be realized by /combining the following 
basic elements: (1) When the whol/e document sets are given, 
document sets the number of words in which range from around 
1000 to around 20, 000 are selected at random repeatedly, and 
at each repetition, the distance between the word 
distribution in each se/ected domument set and the word 
distribution in the wnole documents obtained by 4011, is 
calculated using Numerical expression 1. (2) Baseline 

function B( • ) is/figured out using {(#D, Dist{PD, P0})}s 
and the least square method or the like, where D varies over 
randomly selected document sets in (1) and (#D, Dist{PD, 
P0}) was calculated for each D in (1) . B( • ) is a function 
from the iiumber of words to a positive real number. No 
particu/ar method is specified for this approximation. 
Standard methods are available. 

/ Reference numeral 4013 denotes a document extraction 
mcraule. When term W = wnl . . . wnk is given, a document set 
/D(wni) (1 < i < k) is obtained from the word-document 
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association DB 3014 and the intersection of all D(wni) (1 < i 
< k) is taken to determine D(W). If the word-document 
association DB 3014 records the information on the/position 
of a word in every document, the set of all documents 
containing term W = wnl . . . wnk can be obtairted, which is 
a subset of the intersection of all D(wni/(1 < i < k) . If 
the word-document association DB 3014 does not record the 
information on the position of a word/an the document, the 

P intersection of all D(wni) (1 < i < A) is taken as D(W) as 
'41 / 

gi an approximation. Numeral 4014 (denotes a module for 
K / 

pj calculating co-occurring word (distribution. Again the 
\i / 

v '] frequency of each word in D ( W ) is counted from the word- 

y* / 

document association DB 3014 to determine the frequency ki 

a / 

jjjj of wi in D(W) (1 < i < k)/ Numeral 4015 denotes a module 

H* for calculating distance^between word distributions. Using 

ru 7 

C3 Numerical expression l/ and the word frequencies obtained by 

4011 and 4014, the distance Dist{PD(W) , P0} between the word 
distribution in th'e whole documents and the word 
distribution ini5(W) is calculated. Numeral 4016 denotes 
a module for normalizing the aforementioned distance 
Dist{PD(W), f0) . Using the number of words in D(W), which 
is denoted/ftD (W) , and B( • ) obtained by 4012, it calculates 
the representativeness of W as Rep(W)= Dist{PD(W), 
PO}/B (#D (W) ) . Numeral 4017 denotes a random sampling 
module, which is used in 4013 to select a predetermined 



number of documents when the number of documents contained 
in D(W) surpasses a predetermined number (recorded in the 
shared data area 3017) . While in this instance th/ number 
of documents is used as the predetermined number,/it is also 
possible to use the desirable number of words as the 
predetermined number and to make the number of words in 
randomly sample documents as close to tj/e predetermined 
number as possible. / 

Fig. 5 shows an example of configuration for the 
application of the invention for/assisting document 
retrieval. This diagram illustrates the configuration of a 
retrieving apparatus where tjae invention is applied to the 
displaying of topic words \v( a navigation window in line with 
the configuration shown /n Fig. 1 of the document retrieval 
support method according to Reference 1. It differs from 
the document retrieval support method according to 
Reference 1 in tha£, in a topic words displaying routine 544, 
a representativeness check routine 5445 is added, and in a 
topic words ^extraction routine 5441, a co-occurrence 
analysis rczrutine 5442, a graph mapping routine 5443 and a 
graph displaying routine 5444, the representativeness check 
routing is used. The representativeness check routine is 
a rot/tine to return the representativeness of each word in 
th^e set of the whole documents. It is possible to calculate 



in advance the representativeness of each word according to 
the program shown in Fig. 4. 

When the user enters a retrieval keyword from a 
keyboard 511, the titles of the documents containing that 
keyword, which are the result of retrieval, are displayed 
on a user-interface window for information retrieval 521, 
and topic words selected out of the document set are 
displayed on a window for displaying topic words 522 . First, 
words are selected in the topic words extraction routine 
5441 by the method of Reference 1. Although the word 
selected here include, as stated earlier, common words such 
as "suru" and "Jcono" (this) , the displaying of highly 
frequent stop-words can be suppressed by checking the 
representativeness of words according to the 
representativeness check routine 5445 and excluding words 
whose representativeness values are smaller than a preset 
threshold (for instance, 1.2). Furthermore, if displayed 
words overlap each other by the method of Reference 1, it 
is easy to display more to the front the word higher in 
representativeness or to display in heavier tone the word 
higher in representativeness by using the 

representativeness check routine 5445 in the graph mapping 
routine 5443 and the graph displaying routine 5444. Thus it 
is possible to display words higher in representativeness 
in a more conspicuous way and thereby improve the user 



interface. Furthermore, while the foregoing description 
suggested calculation of the representativeness of/each 
word in advance according to the program shown i.vf Fig. 4, 
it is also possible to regard each set of documents obtained 
for each input keyword as set of whole documents anew, 

and calculate according to the program shown in Fig. 4 the 
representativeness of each word contained in the documents, 
which is the result of retrieval, as it occurs. If the 
representativeness check routine ,6445 is so designed, the 
representativeness of the same/word may differ with the 
keyword, and accordingly it vvill be possible to display 
topic words in a manner reflecting the retrieval situation 
more appropriately. / 

Fig. 6 shows an jexample of configuration for use in 
applying representativeness to automatic word extraction. 
In the figure, 60]/ denotes a storage for storing document 
data, various programs and so forth using a hard disk or the 
like. It is also utilized as a working area for programs. 
Thereaf ter ,/ 6011 denotes document data (although Japanese 
is used i/n the following example, this method is not 
language-specific) ; 6012, a morphological analysis program 
for Identifying words constituting a document (it performs 
suan processing as word separation by spaces and part- 
or-speech tagging in Japanese, or stemming in English; this 
method is not specified; various systems are disclosed in 



both languages, whether for commercial use or research ? 
purposes); 6013, a word-document association program (for 
checking, according to the results of morphological 
analysis, which word occurs in which document amd how often, 
or conversely in which document how many t:Lmes which word 
occurs; basically this is a task to fill elements of a matrix 
having words as rows and documents as columns by counting, 
and no particular method is specified for this task) ; 6014, 
a word-document association database (DB) for recording 
word-document association data calculated as described 
above; 6015, an extracted word/storing DB; 6017, a 
representativeness calculation program, whose details are 
shown in Fig. 4; 6018, a program for calculating the 
representativeness of a term; 6019, an area for a plurality 
of programs to reference data in a shared manner; 601A, a 
program to select the words or word sequences which will 
become the candidates for extraction (though the contents 
are not specified^ words such as particles, auxiliary verbs 
and affixes arse usually excluded from a given result of 

document morphological analysis) ; 601B, a filter for 
utilizing grammatical knowledge to exclude word sequences 
unsuitable as terms out of the candidates selected by 601A 
(for instance, sequences in which a case affix or an 
auxiliary verb comes first or last are excluded; though the 
consents are not specified, a number of examples are 
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mentioned in the paper cited as Reference 2) . The candidate 

selected by 601B undergo the calculation of importance /by 

i 

601C according to a specific measure and, those low^r than 
a preset level of that measure being excluded, a^e sorted 
according to importance and outputted. While this is called 
the tf_idf filter program after the name of the most 
frequently used measure, the actually uaed measure may be 
any appropriate measure other than tfa_df. Reference 
numeral 6016 denotes a working area/ 602, an input device; 
603, a communication device; 604, /a main memory; 605, a CPU; 
and 606, a terminal device consisting of a display, a 
keyboard and so forth. The usual word extraction method uses 
neither 6017 nor 6018. ly( response to the output of 601C, 
the representativeness/of each candidate is referenced by 
6017 and 6018, and taose whose measures are lower than a 
preset level (for instance 1.2) are excluded. A conceivable 
variation would/use 6017 and 6018 in 601C to directly 
reference the/representativeness of each candidate, and 
select the candidate terms according to representativeness 
as the sole criterion. 

experiment was carried out using the automatic word 
extraction method of the configuration illustrated in Fig. 
6, /nd terms were extracted from the abstracts of 1,870 
papers on artificial intelligence. About 18,000 term 
'candidates were extracted by 601A and 601B. Two procedures 
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were tested: in one procedure only representativeness was 
used and in the other term candidates were first sorted by 
tf-idf and the output of sorting was cleared of unimportant 
words by using representativeness. The two procedures 
equally gave about 5, 000 term candidates, but/ the latter 
tended to extract terms in a sequence close to the order of 
frequency, so that in seeking final selection by human 
judgment, the latter may be more natural in a way because 
familiar words come relatively early/: 

By using representativeness jls proposed in the 
present invention, there is provided a representativeness 
calculation which, with respecl/ to terms in a document set, 
(1) gives a clear mathematical meaning, (2) permits 
comparison of high-frequency terms and low-frequency terms, 
(3) makes possible setting of a threshold value in a 
systematic way, and (d) is applicable to terms containing 
any number of words/ Thus a method to calculate the 
importance of words or word sequences can be realized, which 
would prove useful in improving the accuracy of word 
information retrieval interfaces and word extraction 
systems. / 

Whi^e the invention has been particularly shown and 
described with reference to preferred embodiments thereof, 
it wi/l be understood by those skilled in the art that the 
for/going and other changes in form and details can be made 



